Hash-based systems and methods for detecting and preventing transmission of unwanted e-mail

ABSTRACT

A system ( 120 ) detects transmission of potentially unwanted e-mail messages. The system ( 120 ) may receive e-mail messages and generate hash values based on one or more portions of the e-mail messages. The system ( 120 ) may then determine whether the generated hash values match hash values associated with prior e-mail messages. The system ( 120 ) may determine that one of the e-mail messages is a potentially unwanted e-mail message when one or more of the generated hash values associated with the e-mail message match one or more of the hash values associated with the prior e-mail messages.

RELATED APPLICATIONS

This application is a divisional of U.S. patent application Ser. No.10/654,771, filed Sep. 4, 2003, which, in turn, claims priority under 35U.S.C. § 119 based on U.S. Provisional Application No. 60/407,975, filedSep. 5, 2002, both of which are incorporated herein by reference. U.S.patent application Ser. No. 10/654,771 is also a continuation-in-part ofU.S. patent application Ser. No. 10/251,403, filed Sep. 20, 2002, whichclaims priority under 35 U.S.C. § 119 based on U.S. ProvisionalApplication No. 60/341,462, filed Dec. 14, 2001, both of which areincorporated herein by reference. U.S. patent application Ser. No.10/654,771 is also a continuation-in-part of U.S. patent applicationSer. No. 09/881,145, and U.S. patent application Ser. No. 09/881,074,both of which were filed on Jun. 14, 2001, and both of which claimpriority under 35 U.S.C. § 119 based on U.S. Provisional Application No.60/212,425, filed Jun. 19, 2000, all of which are incorporated herein byreference.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to network security and, moreparticularly, to systems and methods for detecting and/or preventing thetransmission of unwanted e-mails, such as e-mails containing worms andviruses, including polymorphic worms and viruses, and unsolicitedcommercial e-mails.

2. Description of Related Art

Availability of low cost computers, high speed networking products, andreadily available network connections has helped fuel the proliferationof the Internet. This proliferation has caused the Internet to become anessential tool for both the business community and private individuals.Dependence on the Internet arises, in part, because the Internet makesit possible for multitudes of users to access vast amounts ofinformation and perform remote transactions expeditiously andefficiently. Along with the rapid growth of the Internet have comeproblems arising from attacks from within the network and the shearvolume of commercial e-mail. As the size of the Internet continues togrow, so does the threat posed to users of the Internet.

Many of the problems take the form of e-mail. Viruses and worms oftenmasquerade within e-mail messages for execution by unsuspecting e-mailrecipients. Unsolicited commercial e-mail, or “spam,” is anotherburdensome type of e-mail because it wastes both the time and resourcesof the e-mail recipient.

Existing techniques for detecting viruses, worms, and span examine eache-mail message individually. In the case of viruses and worms, thistypically means examining attachments for byte-strings found in knownviruses and worms (possibly after uncompressing or de-archiving attachedfiles), or simulating execution of the attachment in a “safe”compartment and examining its behaviors. Similarly, existing spamfilters usually examine a single e-mail message looking for heuristictraits commonly found in unsolicited commercial e-mail, such as anabundance of Uniform Resource Locators (URLs), heavy use ofall-capital-letter words, use of colored text or large fonts, and thelike, and then “score” the message based on the number and types of suchtraits found. Both the anti-virus and the anti-spam techniques candemand significant processing of each message, adding to the resourceburden imposed by unwanted e-mail. Neither technique makes use ofinformation collected from other recent messages.

Thus, there is need for an efficient technique that can quickly detectviruses, worms, and spam in e-mail messages arriving at e-mail servers,possibly by using information contained in multiple recent messages todetect unwanted mail more quickly and efficiently.

SUMMARY OF THE INVENTION

Systems and methods consistent with the present invention address thisand other needs by providing a new defense that detects and prevents thetransmission of unwanted (and potentially unwanted) e-mail, such ase-mails containing viruses, worms, and spam.

In accordance with an aspect of the invention as embodied and broadlydescribed herein, a method for detecting transmission of potentiallyunwanted e-mail messages is provided. The method includes receivinge-mail messages and generating hash values based on one or more portionsof the e-mail messages. The method further includes determining whetherthe generated hash values match hash values associated with prior e-mailmessages. The method may also include determining that one of the e-mailmessages is a potentially unwanted e-mail message when one or more ofthe generated hash values associated with the e-mail message match oneor more of the hash values associated with the prior e-mail messages.

In accordance with another aspect of the invention, a mail serverincludes one or more hash memories and a hash processor. The one or morehash memories is/are configured to store count values associated withhash values. The hash processor is configured to receive an e-mailmessage, hash one or more portions of the e-mail message to generatehash values, and increment the count values corresponding to thegenerated hash values. The hash processor is further configured todetermine whether the e-mail message is a potentially unwanted e-mailmessage based on the incremented count values.

In accordance with yet another aspect of the invention, a method fordetecting transmission of unwanted e-mail messages is provided. Themethod includes receiving e-mail messages and detecting unwanted e-mailmessages of the received e-mail messages based on hashes of previouslyreceived e-mail messages, where multiple hashes are performed on each ofthe e-mail messages.

In accordance with a further aspect of the invention, a method fordetecting transmission of potentially unwanted e-mail messages isprovided. The method includes receiving an e-mail message; generatinghash values over blocks of the e-mail message, where the blocks includeat least two of a main text portion, an attachment portion, and a headerportion of the e-mail message; determining whether the generated hashvalues match hash values associated with prior e-mail messages; anddetermining that the e-mail message is a potentially unwanted e-mailmessage when one or more of the generated hash values associated withthe e-mail message match one or more of the hash values associated withthe prior e-mail messages.

In accordance with another aspect of the invention, a mail server in anetwork of cooperating mail servers is provided. The mail serverincludes one or more hash memories and a hash processor. The one or morehash memories is/are configured to store information relating to hashvalues corresponding to previously-observed e-mails. The hash processoris configured to receive at least some of the hash values from anotherone or more of the cooperating mail servers and store informationrelating to the at least some of the hash values in at least one of theone or more hash memories. The hash processor is further configured toreceive an e-mail message, hash one or more portions of the receivede-mail message to generate hash values, determine whether the generatedhash values match the hash values corresponding to previously-observede-mails, and identify the received e-mail message as a potentiallyunwanted e-mail message when one or more of the generated hash valuesassociated with the received e-mail message match one or more of thehash values corresponding to previously-observed e-mails.

In accordance with yet another aspect of the invention, a mail server isprovided. The mail server includes one or more hash memories and a hashprocessor. The one or more hash memories is/are configured to storecount values associated with hash values. The hash processor isconfigured to receive e-mail messages, hash one or more portions of thereceived e-mail messages to generate hash values, increment the countvalues corresponding to the generated hash values, as incremented countvalues, and generate suspicion scores for the received e-mail messagesbased on the incremented count values.

In accordance with a further aspect of the invention, a method forpreventing transmission of unwanted e-mail messages is provided. Themethod includes receiving an e-mail message; generating hash values overportions of the e-mail message as the e-mail message is being received;and incrementally determining whether the generated hash values matchhash values associated with prior e-mail messages. The method furtherincludes generating a suspicion score for the e-mail message based onthe incremental determining; and rejecting the e-mail message when thesuspicion score of the e-mail message is above a threshold.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute apart of this specification, illustrate the invention and, together withthe description, explain the invention. In the drawings,

FIG. 1 is a diagram of a system in which systems and methods consistentwith the present invention may be implemented;

FIG. 2 is an exemplary diagram of the e-mail server of FIG. 1 accordingto an implementation consistent with the principles of the invention;

FIG. 3 is an exemplary functional block diagram of the e-mail server ofFIG. 2 according to an implementation consistent with the principles ofthe invention;

FIG. 4 is an exemplary diagram of the hash processing block of FIG. 3according to an implementation consistent with the principles of theinvention; and

FIGS. 5A-5E are flowcharts of exemplary processing for detecting and/orpreventing transmission of an unwanted e-mail message, such as an e-mailcontaining a virus or worm, including a polymorphic virus or worn, or anunsolicited commercial e-mail, according to an implementation consistentwith the principles of the invention.

DETAILED DESCRIPTION

The following detailed description of the invention refers to theaccompanying drawings. The same reference numbers in different drawingsmay identify the same or similar elements. Also, the following detaileddescription does not limit the invention. Instead, the scope of theinvention is defined by the appended claims and equivalents.

Systems and methods consistent with the present invention provide virus,worm, and unsolicited e-mail detection and/or prevention in e-mailservers. Placing these features in e-mail servers provides a number ofnew advantages, including the ability to align hash blocks to crucialboundaries found in e-mail messages and eliminate certaincounter-measures by the attacker, such as using small Internet Protocol(IP) fragments to limit the detectable content in each packet. It alsoallows these features to relate e-mail header fields with thepotentially-harmful segment of the message (usually an “attachment”),and decode common file-packing and encoding formats that might otherwisemake a virus or worm undetectable by the packet-based technique (e.g.,“.zip files”).

By placing these features within an e-mail server, the ability to detectreplicated content in the network at points where large quantities oftraffic are present is obtained. By relating many otherwise-independentmessages and finding common factors, the e-mail server may detectunknown, as well as known, viruses and worms. These features may also beapplied to detect potential unsolicited commercial e-mail (“spam”).

E-mail servers for major Internet Service Providers (ISPs) may process amillion e-mail messages a day, or more, in a single server. When virusesand worms are active in the network, a substantial fraction of thise-mail may actually be traffic generated by the virus or worn. Thus, ane-mail server may have dozens to thousands of examples of a singlee-mail-borne virus pass through it in a day, offering an excellentopportunity to determine the relationships between e-mail messages anddetect replicated content (a feature that is indicative of virus/wormpropagation) and spam, among other, more legitimate traffic (such astraffic from legitimate mailing lists).

Systems and methods consistent with the principles of the inventionprovide mechanisms to detect and stop e-mail-borne viruses and wormsbefore the addressed user receives them, in an environment where thevirus is still inert. Current e-mail servers do not normally execute anycode in the e-mail being transported, so they are not usually subject tovirus/worm infections from the content of the e-mails theyprocess—though, they may be subject to infection via other forms ofattack.

Besides e-mail-borne viruses and worms, another common problem found ine-mail is mass-e-mailing of unsolicited commercial e-mail, colloquiallyreferred to as “spam.” It is estimated that perhaps 25%-50% of alle-mail messages now received for delivery by major ISP e-mail servers isspam.

Users of network e-mail services are desirous of mechanisms to blocke-mail containing viruses or worms from reaching their machines (wherethe virus or worm may easily do harm before the user realizes itspresence). Users are also desirous of mechanisms to block unsolicitedcommercial e-mail that consumes their time and resources.

Many commercial e-mail services put a limit on each user's e-mailaccumulating at the server, and not yet downloaded to the customer'smachine. If too much e-mail arrives between times when the user readshis e-mail, additional e-mail is either “bounced” (i.e., returned to thesenders e-mail server) or even simply discarded, both of which eventscan seriously inconvenience the user. Because the user has no controlover arriving e-mail due to e-mail-borne viruses/worms, or spam, it is arelatively common occurrence that the user's e-mail quota overflows dueto unwanted and potentially harmful messages. Similarly, the authors ofe-mail-borne viruses, as well as senders of spam, have no reason tolimit the size of their messages. As a result, these messages are oftenmuch larger than legitimate e-mail messages, thereby increasing therisks of such denial of service to the user by overflowing the per-usere-mail quota.

Users are not the only group inconvenienced by spam and e-mail-borneviruses and worms. Because these types of unwanted e-mail can form asubstantial fraction, even a majority, of e-mail traffic in theInternet, for extended periods of time, ISPs typically must add extraresources to handle a peak e-mail load that would otherwise be abouthalf as large. This ratio of unwanted-to-legitimate e-mail trafficappears to be growing daily. Systems and methods consistent with theprinciples of the invention provide mechanisms to detect and discardunwanted e-mail in network e-mail servers.

Exemplary System Configuration

FIG. 1 is a diagram of an exemplary system 100 in which systems andmethods consistent with the present invention may be implemented. System100 includes mail clients 110 connected to a mail server 120 via anetwork 130. Connections made in system 100 may be via wired, wireless,and/or optical communication paths. While FIG. 1 shows three mailclients 110 and a single mail server 120, there can be more or fewerclients and servers in other implementations consistent with theprinciples of the invention.

Network 130 may facilitate communication between mail clients 110 andmail server 120. Typically, network 130 may include a collection ofnetwork devices, such as routers or switches, that transfer data betweenmail clients 110 and mail server 120. In an implementation consistentwith the present invention, network 130 may take the form of a wide areanetwork, a local area network, an intranet, the Internet, a publictelephone network, a different type of network, or a combination ofnetworks.

Mail clients 110 may include personal computers, laptops, personaldigital assistants, or other types of wired or wireless devices that arecapable of interacting with mail server 120 to receive e-mails. Inanother implementation, clients 110 may include software operating uponone of these devices. Client 110 may present e-mails to a user via agraphical user interface.

Mail server 120 may include a computer or another device that is capableof providing e-mail services for mail clients 110. In anotherimplementation, server 120 may include software operating upon one ofthese devices.

FIG. 2 is an exemplary diagram of mail server 120 according to animplementation consistent with the principles of the invention Server120 may include bus 210, processor 220, main memory 230, read onlymemory (ROM) 240, storage device 250, input device 260, output device270, and communication interface 280. Bus 210 permits communicationamong the components of server 120.

Processor 220 may include any type of conventional processor ormicroprocessor that interprets and executes instructions. Main memory230 may include a random access memory (RAM) or another type of dynamicstorage device that stores information and instructions for execution byprocessor 220. ROM 240 may include a conventional ROM device or anothertype of static storage device that stores static information andinstructions for use by processor 220. Storage device 250 may include amagnetic and/or optical recording medium and its corresponding drive.

Input device 260 may include one or more conventional mechanisms thatpermit an operator to input information to server 120, such as akeyboard, a mouse, a pen, voice recognition and/or biometric mechanisms,etc. Output device 270 may include one or more conventional mechanismsthat output information to the operator, such as a display, a printer, apair of speakers, etc. Communication interface 280 may include anytransceiver-like mechanism that enables server 120 to communicate withother devices and/or systems. For example, communication interface 280may include mechanisms for communicating with another device or systemvia a network, such as network 130.

As will be described in detail below, server 120, consistent with thepresent invention, provides e-mail services to clients 110, whiledetecting unwanted e-mails and/or preventing unwanted e-mails fromreaching clients 110. Server 120 may perform these tasks in response toprocessor 220 executing sequences of instructions contained in, forexample, memory 230. These instructions may be read into memory 230 fromanother computer-readable medium, such as storage device 250 or acarrier wave, or from another device via communication interface 280.

Execution of the sequences of instructions contained in memory 230 maycause processor 220 to perform processes that will be described later.Alternatively, hardwired circuitry may be used in place of or incombination with software instructions to implement processes consistentwith the present invention. Thus, processes performed by server 120 arenot limited to any specific combination of hardware circuitry andsoftware.

FIG. 3 is an exemplary functional block diagram of mail server 120according to an implementation consistent with the principles of theinvention. Server 120 may include a Simple Mail Transfer Protocol (SMTP)block 310, a Post Office Protocol (POP) block 320, an Internet MessageAccess Protocol (IMAP) block 330, and a hash processing block 340.

SMTP block 310 may permit mail server 120 to communicate with other mailservers connected to network 130 or another network. SMTP is designed toefficiently and reliably transfer e-mail across networks. SMTP definesthe interaction between mail servers to facilitate the transfer ofe-mail even when the mail servers are implemented on different types ofcomputers or running different operating systems.

POP block 320 may permit mail clients 110 to retrieve e-mail from mailserver 120. POP block 320 may be designed to always receive incominge-mail. POP block 320 may then hold e-mail for mail clients 110 untilmail clients 110 connect to download them.

IMAP block 330 may provide another mechanism by which mail clients 110can retrieve e-mail from mail server 120. IMAP block 330 may permit mailclients 110 to access remote e-mail as if the e-mail was local to mailclients 110.

Hash processing block 340 may interact with SMTP block 310, POP block320, and/or IMAP block 330 to detect and prevent transmission ofunwanted e-mail, such as e-mails containing viruses or worms andunsolicited commercial e-mail (spam).

FIG. 4 is an exemplary diagram of hash processing block 340 according toan implementation consistent with the principles of the invention. Hashprocessing block 340 may include hash processor 410 and one or more hashmemories 420. Hash processor 410 may include a conventional processor,an application specific integrated circuit (ASIC), a field-programmablegate array (FPGA), or some other type of device that generates one ormore representations for each received e-mail and records the e-mailrepresentations in hash memory 420.

An e-mail representation will likely not be a copy of the entire e-mail,but rather it may include a portion of the e-mail or some unique valuerepresentative of the e-mail. For example, a fixed width number may becomputed across portions of the e-mail in a manner that allows theentire e-mail to be identified.

To further illustrate the use of representations, a 32-bit hash value,or digest, may be computed across portions of each e-mail. Then, thehash value may be stored in hash memory 420 or may be used as an index,or address, into hash memory 420. Using the hash value, or an indexderived therefrom, results in efficient use of hash memory 420 whilestill allowing the content of each e-mail passing through mail server120 to be identified.

Systems and methods consistent with the present invention may use anystorage scheme that records information about one or more portions ofeach e-mail in a space-efficient fashion, that can definitivelydetermine if a portion of an e-mail has not been observed, and that canrespond positively (i.e., in a predictable way) when a portion of ane-mail has been observed. Although systems and methods consistent withthe present invention can use virtually any technique for derivingrepresentations of portions of e-mails, the remaining discussion willuse hash values as exemplary representations of portions of e-mailsreceived by mail server 120.

In implementations consistent with the principles of the invention, hashprocessor 410 may hash one or more portions of a received e-mail toproduce a hash value used to facilitate hash-based detection. Forexample, hash processor 410 may hash one or more of the main text withinthe message body, any attachments, and one or more header fields, suchas sender-related fields (e.g., “From:,” “Sender:,” “Reply-To:,”“Return-Path:,” and “Error-To:”). Hash processor 410 may perform one ormore hashes on each of the e-mail portions using the same or differenthash functions.

As described in more detail below, hash processor 410 may use the hashresults of the hash operation to recognize duplicate occurrences ofe-mails and raise a warning if the duplicate e-mail occurrences arrivewithin a short period of time and raise their level of suspicion abovesome threshold. It may also be possible to use the hash results fortracing the path of an unwanted e-mail through the network.

Each hash value may be determined by taking an input block of data andprocessing it to obtain a numerical value that represents the giveninput data. Suitable hash functions are readily known in the art andwill not be discussed in detail herein. Examples of hash functionsinclude the Cyclic Redundancy Check (CRC) and Message Digest 5 (MD5).The resulting hash value, also referred to as a message digest or hashdigest, may include a fixed length value. The hash value may serve as asignature for the data over which it was computed.

The hash value essentially acts as a fingerprint identifying the inputblock of data over which it was computed. Unlike fingerprints, however,there is a chance that two very different pieces of data will hash tothe same value, resulting in a hash collision. An acceptable hashfunction should provide a good distribution of values over a variety ofdata inputs in order to prevent these collisions. Because collisionsoccur when different input blocks result in the same hash value, anambiguity may arise when attempting to associate a result with aparticular input.

Hash processor 410 may store a representation of each e-mail it observesin hash memory 420. Hash processor 410 may store the actual hash valuesas the e-mail representations or it may use other techniques forminimizing storage requirements associated with retaining hash valuesand other information associated therewith. A technique for minimizingstorage requirements may use one or more arrays or Bloom filters.

Rather than storing the actual hash value, which can typically be on theorder of 32 bits or more in length, hash processor 410 may use the hashvalue as an index for addressing an array within hash memory 420. Inother words, when hash processor 410 generates a hash value for aportion of an e-mail, the hash value serves as the address location intothe array. At the address corresponding to the hash value, a count valuemay be incremented at the respective storage location, thus, indicatingthat a particular hash value, and hence a particular e-mail portion, hasbeen seen by hash processor 410. In one implementation, the count valueis associated with an 8-bit counter with a maximum value that sticks at255. While counter arrays are described by way of example, it will beappreciated by those skilled in the relevant art, that other storagetechniques may be employed without departing from the spirit of theinvention.

Hash memory 420 may store a suspicion count that is used to determinethe overall suspiciousness of an e-mail message. For example, the countvalue (described above) may be compared to a threshold, and thesuspicion count for the e-mail may be incremented if the threshold isexceeded. Hence, there may be a direct relationship between the countvalue and the suspicion count, and it may be possible for the two valuesto be the same. The larger the suspicion count, the more important thehit should be considered in determining the overall suspiciousness ofthe packet. Alternatively, the suspicion count can be combined in a“scoring function” with values from this or other hash blocks in thesame message in order to determine whether the message should beconsidered suspicious.

It is not enough, however, for hash memory 420 to simply identify thatan e-mail contains content that has been seen recently. There are manylegitimate sources (e.g., e-mail list servers) that produce multiplecopies of the same message, addressed to multiple recipients. Similarly,individual users often e-mail messages to a group of people and, thus,multiple copies might be seen if several recipients happen to receivetheir mail from the same server. Also, people often forward copies ofreceived messages to friends or co-workers.

In addition, virus/worm authors typically try to minimize the replicatedcontent in each copy of the virus/worm, in order to not be detected byexisting virus and worm detection technology that depends on detectingfixed sequences of bytes in a known virus or worm. These mutableviruses/worms are usually known as polymorphic, and the attacker's goalis to minimize the recognizability of the virus or worm by scramblingeach copy in a different way. For the virus or worm to remain viable,however, a small part of it can be mutable in only a relatively smallnumber of ways, because some of its code must be immediately-executableby the victim's computer, and that limits the mutation and obscurementpossibilities for the critical initial code part.

In order to accomplish the proper classification of various types oflegitimate and unwanted e-mail messages, multiple hash memories 420 canbe employed, with separate hash memories 420 being used for specificsub-parts of a standard e-mail message. The outputs of different ones ofhash memories 420 can then be combined in an overall “scoring” orclassification function to determine whether the message is undesirableor legitimate, and possibly estimate the probability that it belongs toa particular class of traffic, such as a virus/worm message, spam,e-mail list message, normal user-to-user message.

For e-mail following the Internet mail standard RFC 822 (and its variousextensions), hashing of certain individual e-mail header fields intofield-specific hash memories 420 may be useful. Among the header fieldsfor which this may be helpful are: (1) various sender-related fields,such as “From:”, “Sender:”, “Reply-To:”, “Return-Path:” and “Error-To:”;(2) the “To:” field (often a fixed value for a mailing list, frequentlymissing or idiosyncratic in spam messages); and (3) the last few“Received:” headers (i.e., the earliest ones, since they are normallyadded at the top of the message), excluding any obvious timestamp data.It may also be useful to hash a combination of the “From:” field and thee-mail address of the recipient (transferred as part of the SNIPmail-transfer protocol, and not necessarily found in the messageitself).

Any or all of hash memories 420 may be pre-loaded with knowledge ofknown good or bad traffic. For example, known viruses and spam content(e.g., the infamous “Craig Shergold letter” or many pyramid swindleletters) can be pre-hashed into the relevant hash memories 420, and/orperiodically refreshed in the memory as part of a periodic “cleaning”process described below. Also, known legitimate mailing lists, such asmailing lists from legitimate e-mail list servers, can be added to a“From:” hash memory 420 that passes traffic without further examination.

Over time, hash memories 420 may fill up and the possibility ofoverflowing an existing count value increases. The risk of overflowing acount value may be reduced if the counter arrays are periodicallyflushed to other storage media, such as a magnetic disk drive, opticalmedia, solid state drive, or the like. Alternatively, the counter arraysmay be slowly and incrementally erased. To facilitate this, a time-tablemay be established for flushing/erasing the counter arrays. If desired,the flushing/erasing cycle can be reduced by computing hash values onlyfor a subset of the e-mails received by mail server 120. While thisapproach reduces the flushing/erasing cycle, it increases thepossibility that a target e-mail may be missed (i.e., a hash value isnot computed over a portion of it).

Non-zero storage locations within hash memories 420 may be decrementedperiodically rather than being erased. This may ensure that the “randomnoise” from normal e-mail traffic would not remain in a counter arrayindefinitely. Replicated traffic (e.g., e-mails containing a virus/wormthat are propagating repeatedly across the network), however, wouldnormally cause the relevant storage locations to stay substantiallyabove the “background noise” level.

One way to decrement the count values in the counter array fairly is tokeep a total count, for each hash memory 420, of every time one of thecount values is incremented. After this total count reaches somethreshold value (probably in the millions), for every time a count valueis incremented in hash memory 420, another count value gets decremented.One way to pick the count value to decrement is to keep a counter, as adecrement pointer, that simply iterates through the storage locationssequentially. Every time a decrement operation is performed, thefollowing may done: (a) examine the candidate count value to bedecremented and if non-zero, decrement it and increment the decrementpointer to the next storage location; and (b) if the candidate countvalue is zero, then examine each sequentially-following storage locationuntil a non-zero count value is found, decrement that count value, andadvance the decrement pointer to the following storage location.

It may be important to avoid decrementing any counters below zero, whilenot biasing decrements unfairly. Because it may be assumed that the hashis random, this technique should not favor any particular storagelocation, since it visits each of them before starting over. Thistechnique may be superior to a timer-based decrement because it keeps afixed total count population across all of the storage locations,representing the most recent history of traffic, and is not subject tochanges in behavior as the volume of traffic varies over time.

A variation of this technique may include randomly selecting a countvalue to decrement, rather than processing them cyclically. In thisvariation, if the chosen count value is already zero, then another onecould be picked randomly, or the count values in the storage locationsfollowing the initially-chosen one could be examined in series, until anon-zero count value is found.

Exemplary Processing for Unwanted E-Mail Detection/Prevention

FIGS. 5A-5E are flowcharts of exemplary processing for detecting and/orpreventing transmission of unwanted e-mail, such as an e-mail containinga virus or worm, including a polymorphic virus or worm, or anunsolicited commercial e-mail (spam), according to an implementationconsistent with the principles of the invention. The processing of FIGS.5A-5E will be described in terms of a series of acts that may beperformed by mail server 120. In implementations consistent with theprinciples of the invention, some of the acts may be optional and/orperformed in an order different than that described. In otherimplementations, different acts may be substituted for described acts oradded to the process.

Processing may begin when hash processor 410 (FIG. 4) receives, orotherwise observes, an e-mail message (act 502) (FIG. 5A). Hashprocessor 410 may hash the main text of the message body, excluding anyattachments (act 504). When hashing the main text, hash processor 410may perform one or more conventional hashes covering one or moreportions, or all, of the main text. For example, hash processor 410 mayperform hash functions on fixed or variable sized blocks of the maintext. It may be beneficial for hash processor 410 to perform multiplehashes on each of the blocks using the same or different hash functions.

It may be desirable to pre-process the main text to remove attempts tofool pattern-matching mail filters. An example of this is HyperTextMarkup Language (HTML) e-mail, where spammers often insert random textstrings in HTML comments between or within words of the text. Suche-mail may be referred to as “polymorphic spam” because it attempts tomake each message appear unique. This method for evading detection mightotherwise defeat the hash detection technique, or other string-matchingtechniques. Thus, removing all HTML comments from the message beforehashing it may be desirable. It might also be useful to delete HTML tagsfrom the message, or apply other specialized, but simple, preprocessingtechniques to remove content not actually presented to the user. Ingeneral, this may be done in parallel with the hashing of the messagetext, since viruses and worms may be hidden in the non-visible contentof the message text.

Hash processor 410 may also hash any attachments, after first attemptingto expand them if they appear to be known types of compressed files(e.g., “zip” files) (act 506). When hashing an attachment, hashprocessor 410 may perform one or more conventional hashes covering oneor more portions, or all, of the attachment. For example, hash processor410 may perform hash functions on fixed or variable sized blocks of theattachment. It may be beneficial for hash processor 410 to performmultiple hashes on each of the blocks using the same or different hashfunctions.

Hash processor 410 may compare the main text and attachment hashes withknown viruses, worms, or spam content in a hash memory 420 that ispre-loaded with information from known viruses, worms, and spam content(acts 508 and 510). If there are any hits in this hash memory 420, thereis a probability that the e-mail message contains a virus or worm or isspam. A known polymorphic virus may have only a small number of hashesthat match in this hash memory 420, out of the total number of hashblocks in the message. A non-polymorphic virus may have a very highfraction of the hash blocks hit in hash memory 420. For this reason,storage locations within hash memory 420 that contain entries frompolymorphic viruses or worms may be given more weight during thepre-loading process, such as by giving them a high initial suspicioncount value.

A high fraction of hits in this hash memory 420 may cause the message tobe marked as a probable known virus/worm or spam. In this case, thee-mail message can be sidetracked for remedial action, as describedbelow.

A message with a significant “score” from polymorphic virus/worm hashvalue hits may or may not be a virus/worm instance, and may besidetracked for further investigation, or marked as suspicious beforeforwarding to the recipient. An additional check may also be made todetermine the level of suspicion.

For example, hash processor 410 may hash a concatenation of the From andTo header fields of the e-mail message (act 512) (FIG. 5B), Hashprocessor 410 may then check the suspicion counts in hash memories 420for the hashes of the main text, any attachments, and the concatenatedFrom/To (act 514). Hash processor 410 may determine whether the maintext or attachment suspicion count is significantly higher than theFrom/To suspicion count (act 516). If so, then the content is appearingmuch more frequently outside the messages between this set of users(which might otherwise be due to an e-mail exchange with repeatedmessage quotations) and, thus, is much more suspicious.

When this occurs, hash processor 410 may take remedial action (act 518).The remedial action taken might take different forms, which may beprogrammable or determined by an operator of mail server 120. Forexample, hash processor 410 may discard the e-mail. This is notrecommended for anything but virtually-certain virus/worm/spamidentification, such as a perfect match to a known virus.

As an alternate technique, hash processor 410 may mark the e-mail with awarning in the message body, in an additional header, or otheruser-visible annotation, and allow the user to deal with it when it isdownloaded. For data that appears to be from an unknown mailing list, avariant of this option is to request the user to send back a replymessage to the server, classifying the suspect message as either spam ora mailing list. In the latter case, the mailing list source address canbe added to the “known legitimate mailing lists” hash memory 420.

As another technique, hash processor 410 may subject the e-mail to moresophisticated (and possibly more resource-consuming) detectionalgorithms to make a more certain determination. This is recommended forpotential unknown viruses/worms or possible detection of a polymorphicvirus/worn.

As yet another technique, hash processor 410 may hold the e-mail messagein a special area and create a special e-mail message to notify the userof the held message (probably including From and Subject fields). Hashprocessor 410 may also give instructions on how to retrieve the message.

As a further technique, hash processor 410 may mark the e-mail messagewith its suspicion score result, but leave it queued for the user'sretrieval. If the user's quota would overflow when a new messagearrives, the score of the incoming message and the highest score of thequeued messages are compared. If the highest queued message has a scoreabove a settable threshold, and the new message's score is lower thanthe threshold, the queued message with the highest score may be deletedfrom the queue to make room for the new message. Otherwise, if the newmessage has a score above the threshold, it may be discarded or“bounced” (e.g., the sending e-mail server is told to hold the messageand retry it later). Alternatively, if it is desired to never bounceincoming messages, mail server 120 may accept the incoming message intothe user's queue and repeatedly delete messages with the highestsuspicion score from the queue until the total is below the user's quotaagain.

As another technique, hash processor 410 may apply hash-based functionsas the e-mail message starts arriving from the sending server anddetermine the message's suspicion score incrementally as the message isread in. If the message has a high-enough suspicion score (above athreshold) during the early part of the message, mail server 120 mayreject the message, optionally with either a “retry later” or a“permanent refusal” result to the sending server (which one is used maybe determined by settable thresholds applied to the total suspicionscore, and possibly other factors, such as server load). This results inthe unwanted e-mail using up less network bandwidth and receiving serverresources, and penalizes servers sending unwanted mail, relative tothose that do not.

If the suspicion count for the main text or any attachment is notsignificantly higher than the From/To suspicion count (act 516), hashprocessor 410 may determine whether the main text or any attachment hassignificant replicated content (non-zero or high suspicion count valuesfor many hash blocks in the text/attachment content in all storagelocations of hash memories 420) (act 520) (FIG. 5A). If not, the messageis probably a normal user-to-user e-mail. These types of messages may be“passed” without further examination. When appropriate, hash processor410 may also record the generated hash values by incrementing thesuspicion count value in the corresponding storage locations in hashmemory 420.

If the message text is substantially replicated (e.g., greater than90%), hash processor 410 may check one or more portions of the e-mailmessage against known legitimate mailing lists within hash memory 420(act 522) (FIG. 5C). For example, hash processor 410 may hash the Fromor Sender fields of the e-mail message and compare it/them to knownlegitimate mailing lists within hash memory 420. Hash processor 410 mayalso determine whether the e-mail actually appears to originate from thecorrect source for the mailing list by examining, for example, thesequence of Received headers. Hash processor 410 may further examine acombination of the From or Sender fields and the recipient address todetermine if the recipient has previously received e-mail from thesender. This is typical for mailing lists, but atypical of unwantede-mail, which will normally not have access to the actual list ofrecipients for the mailing list. Failure of this examination may simplypass the message on, but mark it as “suspicious,” since the recipientmay simply be a new subscriber to the mailing list, or the mailings maybe infrequent enough to not persist in the hash counters betweenmailings.

If there is a match with a legitimate mailing list (act 524), then themessage is probably a legitimate mailing list duplicate and may bepassed with no further examination. This assumes that the mailing listserver employs some kind of filtering to exclude unwanted e-mail (e.g.,refusing to forward e-mail that does not originate with a known listrecipient or refusing e-mail with attachments).

If there is no match with any legitimate mailing lists within hashmemory 420, hash processor 410 may hash the sender-related fields (e.g.,From, Sender, Reply-To) (act 526). Hash processor 410 may then determinethe suspicion count for the sender-related hashes in hash memories 420(act 528).

Hash processor 410 may determine whether the suspicion counts for thesender-related hashes are similar to the suspicion count(s) for the maintext hash(es) (act 530) (FIG. 5D). If both From and Sender fields arepresent, then the Sender field should match with roughly the samesuspicion count value as the message body hash. The From field may ormay not match. For a legitimate mailing list, it may be a legitimatemailing list that is not in the known legitimate mailing lists hashmemory 420 (or in the case where there is no known legitimate mailinglists hash memory 420). If only the From field is present, it shouldmatch about as well as the message text for a mailing list. If none ofthe sender-related fields match as well as the message text, the e-mailmessage may be considered moderately suspicious (probably spam, with avariable and fictitious From address or the like).

As an additional check, hash processor 410 may hash the concatenation ofthe sender-related field with the highest suspicion count value and thee-mail recipient's address (act 532). Hash processor 410 may then checkthe suspicion count for the concatenation in a hash memory 420 used justfor this check (act 534). If it matches with a significant suspicioncount value (act 536) (FIG. 5E), then the recipient has recentlyreceived multiple messages from this source, which makes it probablethat it is a mailing list. The e-mail message may then be passed withoutfurther examination.

If the message text or attachments are mostly replicated (e.g., greaterthan 90% of the hash blocks), but with mostly low suspicion count valuesin hash memory 420 (act 538), then the message is probably a case of asmall-scale replication of a single message to multiple recipients. Inthis case, the e-mail message may then be passed without furtherexamination.

If the message text or attachments contain some significant degree ofcontent replication (say, greater than 50% of the hash blocks) and atleast some of the hash values have high suspicion count values in hashmemory 420 (act 540), then the message is fairly likely to be avirus/worm or spam. A virus or worm should be considered more likely ifthe high-count matches are in an attachment. If the highly-replicatedcontent is in the message text, then the message is more likely to bespam, though it is possible that e-mail text employing a scriptinglanguage (e.g., Java script) might also contain a virus.

If the replication is in the message text, and the suspicion count issubstantially higher for the message text than for the From field, themessage is likely to be spam (because spammers generally vary the Fromfield to evade simpler spam filters). A similar check can be made forthe concatenation of the From and To header fields, except that in thiscase, it is most suspicious if the From/To hash misses (finds a zerosuspicion count), indicating that the sender does not ordinarily sende-mail to that recipient, making it unlikely to be a mailing list, andvery likely to be a spammer (because they normally employ random orfictitious From addresses).

In the above cases, hash processor 410 may take remedial action (act542). The particular type of action taken by hash processor 410 may varyas described above.

CONCLUSION

Systems and methods consistent with the present invention providemechanisms within an e-mail server to detect and/or prevent transmissionof unwanted e-mail, such as e-mail containing viruses or worms,including polymorphic viruses and worms, and unsolicited commerciale-mail (spam).

Implementation of a hash-based detection mechanism in an e-mail serverat the e-mail message level provides advantages over a packet-basedimplementation in a router or other network node device. For example,the entire e-mail message has been re-assembled, both at the packetlevel (i.e., IP fragment re-assembly) and at the application level(multiple packets into a complete e-mail message). Also, the hashingalgorithm can be applied more intelligently to specific parts of thee-mail message (e.g., header fields, message body, and attachments).Attachments that have been compressed for transport (e.g., “.zip” files)can be expanded for inspection. Without doing this, a polymorphic viruscould easily hide inside such files with no repeatable hash signaturevisible at the packet transport level.

With the entire message available for a single pass of the hashingprocess, packet boundaries and packet fragmentation do not splitsequences of bytes that might otherwise provide useful hash signatures.A clever attacker might otherwise obscure a virus/worm attack by causingthe IP packets carrying the malicious code to be fragmented into piecessmaller than that for which the hashing process is effective, or byforcing packet breaks in the middle of otherwise-visible fixed sequencesof code in the virus/worm. Also, the entire message is likely to belonger than a single packet, thereby reducing the probability of falsealarms (possibly due to insufficient hash-block sample size and too fewhash blocks per packet) and increasing the probability of correctidentification of a virus/worm (more hash blocks will match per messagethan per packet, since packets will be only parts of the entiremessage).

Also, fewer hash-block alignment issues arise when the hash blocks canbe intelligently aligned with fields of the e-mail message, such as thestart of the message body, or the start of an attachment block. Thisresults in faster detection of duplicate contents than if the blocks arerandomly aligned (as is the case when the method is applied toindividual packets).

Email-borne malicious code, such as viruses and worms, also usuallyincludes a text message designed to cause the user to read the messageand/or perform some other action that will activate the malicious code.It is harder for such text to be polymorphic, because automaticscrambling of the user-visible text will either render itsuspicious-looking, or will be very limited in variability. This fact,combined with the ability to start a hash block at the start of themessage text by parsing the e-mail header, reduces the variability inhash signatures of the message, making it easier to detect with fewerexamples seen.

Further, the ability to extract and hash specific headers from an e-mailmessage separately may be used to help classify the type of replicatedcontent the message body carries. Because many legitimate cases ofmessage replication exist (e.g., topical mailing lists, such as YahooGroups), intelligent parsing and hashing of the message headers is veryuseful to reduce the false alarm rate, and to increase the accuracy ofdetection of real viruses, worms, and spam.

This detection technique, compared to others which might extract andsave fixed strings to be searched for in other pieces of e-mail,includes hash-based filters that are one-way functions (i.e., it ispossible, given a piece of text, to determine if it has been seen beforein another message). Given the state data contained in the filter,however, it is virtually impossible to reconstruct a prior message, orany piece of a prior message, that has been passed through the filterpreviously. Thus, this technique can maintain the privacy of e-mail,without retaining any information that can be attributed to a specificsender or receiver.

The foregoing description of preferred embodiments of the presentinvention provides illustration and description, but is not intended tobe exhaustive or to limit the invention to the precise form disclosed.Modifications and variations are possible in light of the aboveteachings or may be acquired from practice of the invention.

For example, systems and methods have been described with regard to amail server. In other implementations, the systems and methods describedherein may be used within other devices, such as a mail client. In sucha case, the mail client may periodically obtain suspicion count valuesfor its hash memory from one or more network devices, such as a mailserver.

It may be possible for multiple mail servers to work together to detectand prevent unwanted e-mails. For example, high-scoring entries from thehash memory of one mail server might be distributed to other mailservers, as long as the same hash functions are used by the samecooperating servers. This may accelerate the detection process,especially for mail servers that experience relatively low volumes oftraffic.

Further, certain portions of the invention have been described as“blocks” that perform one or more functions. These blocks may includehardware, such as an ASIC or a FPGA, software, or a combination ofhardware and software.

No element, act, or instruction used in the description of the presentapplication should be construed as critical or essential to theinvention unless explicitly described as such. Also, as used herein, thearticle “a” is intended to include one or more items. Where only oneitem is intended, the term “one” or similar language is used. The scopeof the invention is defined by the claims and their equivalents.

1. In a network of cooperating mail servers, one of the mail serverscomprising: one or more hash memories configured to store informationrelating to hash values corresponding to previously-observed e-mails;and a hash processor configured to: receive at least some of the hashvalues from another one or more of the cooperating mail servers, storeinformation relating to the at least some of the hash values in at leastone of the one or more hash memories, receive an e-mail message, hashone or more portions of the received e-mail message to generate hashvalues, as generated hash values, determine whether the generated hashvalues match the hash values corresponding to previously-observede-mails, and identify the received e-mail message as a potentiallyunwanted e-mail message when one or more of the generated hash valuesassociated with the received e-mail message match one or more of thehash values corresponding to previously-observed e-mails.